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Abstract. Graph kernels are usually defined in terms of simpler kernels 
over local substructures of the original graphs. Different kernels consider 
different types of substructures. However, in some cases they have simi¬ 
lar predictive performances, probably because the substructures can be 
interpreted as approximations of the subgraphs they induce. In this pa¬ 
per, we propose to associate to each feature a piece of information about 
the context in which the feature appears in the graph. A substructure 
appearing in two different graphs will match only if it appears with the 
same context in both graphs. We propose a kernel based on this idea that 
considers trees as substructures, and where the contexts are features too. 
The kernel is inspired from the framework in 7], even if it is not part 
of it. We give an efficient algorithm for computing the kernel and show 
promising results on real-world graph classification datasets. 
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1 Introduction 

In many application domains data can be naturally represented in a structured 
form, e.g. in Chemoinformatics [I] or in natural language processing [3]. For this 
reason, in the last few years an interest in machine learning techniques applicable 
to data represented in structured (non-vectorial) form arose |l419j . When deal¬ 
ing with machine learning for graph-structured data, kernel methods are one of 
the most popular approaches to follow. It just suffices to use a kernel for graphs 
together with any kernelized learning algorithm (e.g. SVM, SVR, KPCA, ...) 
and the user has a powerful, ready-to-use learning algorithm with strong the¬ 
oretical bounds on its generalization performance. The predictive performance 
of the resulting learning procedure strongly depends on the particular kernel 
choice. The design of efficient graph kernels is not a trivial task, because several 
graph operations (e.g. the graph isomorphism) are not efficiently computable. 
The idea is to design kernels that are the most expressive as possible, in order 
to have a small information loss. Several alternatives have been proposed in lit¬ 
erature. However it is difficult to state a priori which kernel will perform better 
in a specific task, because most of the existing kernels consider different approx¬ 
imations of the same local structures. In this paper, we propose a method to 




enrich the feature space of a kernel with contextual information, i.e. we attach 
to a feature a piece of information about the topology of the graph in which that 
feature appeared. We apply this idea to the ODD kernel [7], and we define as the 
context of a feature another feature from the same kernel. We give an efficient 
algorithm for the kernel computation, and experimentally evaluate our proposal 
on five real-world datasets. 


2 Definitions and notation 

Let G = (Vg, Eg, Lg) be a graph, where Vq is the set of vertices (or nodes), 
Eg Q {(t 1 ,;, Vj)\vi, Vj £ Vq} is the set of edges and Lq : Vq —>• E is a labeling 
function mapping each vertex to an element in a fixed alphabet E. 

A graph is undirected if (i,j) £ Eg ==> ( j,i ) G Eg, otherwise it is directed. 
A walk w(u, v) in a graph is a sequence of nodes v ±,..., v n s.t. (vi, fi+i) £ Eg 
and v\ = u,v n = v. The length of a walk |u>(rt, w)| is defined as the number of 
edges in such walk. A cycle is a walk where v\ = v n . A graph is acyclic if it does 
not contain cycles. A DAG is a directed acyclic graph. A path is a walk with no 
repeated nodes, i.e. where V" =1 V" =1 , i j ==>- Vi Vj. A shortest path between 
two vertices sp(u , v ) G Vq is a path with the minimum length that starts from 
u and ends in v. Note that the shortest paths are not unique, but their length 
| sp(u, v) | is. n_sp{u , v) is a function returning the number of such shortest paths. 
A rooted DAG D is a DAG in which one vertex r has been designated as the 
root. The root have no incoming edges, i.e. flu G Ed, ( u,r ) G Ed- The function 
r(D) returns the root of a rooted DAG. 

A (rooted) tree is a rooted DAG where for each node there exists exactly one 
path connecting the root node to it. The children children(v) of a node v G Vr 
in a tree are all the nodes u G Vr s.t. (v,u) G Ex- The number of children, 
or out-degree, of a vertex v is p(v). Similarly we can say that v is a parent of 
u. chi(v,G) is the function retuning the i-th child of v G Vq (according to a 
particular order). 

A proper subtree rooted at u G Vr of a tree T is the subtree that comprehends 
u and all its descendants. We will refer to it as ^ G T. We define Tflv, G) as a 
function returning the tree-visit of a graph G, rooted at v and limited at height j. 
Note that this tree-visit is the shortest-path tree between v and any u G Vq s.t. 
\sp(u, w)| < j. Moreover, we denote with T{v,G) the tree-visit at the maximum 
possible height, i.e. T 00 (u,G) = T diam ^G){v,G) where diam(G ) is the diameter 
of a graph, i.e. the length of the longest shortest path between two vertices. 

A DAG-visit of a graph G, DAGj(v,G), is defined as the DAG of the shortest 
paths of length up to j. The main difference between DAG{v, G) and T{v, G) is 
that the number of nodes in the former is bounded by \Vg\ while in the latter it is 
not. We assume the nodes in Tj(v, G) or DAGj(v, G ) to be ordered according to 
the lexicographic order between the node labels (in case two nodes have the same 
label, the ordering is recursively induced from the children). Such an ordering 
has been proven to be well-defined in [7] for DAGs. Since trees are a special 
case of DAGs, the ordering relation is well-defined on trees as well. For ease of 


notation, when clear from the context, the link to the graph G will be omitted 
from the above-mentioned functions. 


3 Graph Kernels 

Most of the existing graph kernels are members of the i?-convolution kernels 
framework m- The idea of this framework is to decompose the original struc¬ 
ture into a set of simpler structures, where a (efficient) kernel is already de¬ 
fined. For example, the all-subgraphs graph kernel m has a feature associated 
to each possible graph. However, this kernel also happens to be NP-complete. 
An approach to reduce the computational complexity of the resulting kernel is 
to restrict the set of considered substructures of the graph. Different substruc¬ 
tures raise different kernels. For example, in literature kernels based on random 
walks QjJj , shortest paths [2] , subtree-patterns m, subtrees [?; or pairs of small 
rooted subgraphs [3] have been proposed. The main drawback of these kernels 
is that they consider only local substructures of the original graphs, whose size 
is bounded to some limit due to computational complexity. For this reason, in 
some cases they have similar predictive performances jS], probably because the 
different substructures can be interpreted as different, but still similar, approxi¬ 
mations of small subgraphs of the original graph. Enlarging the substructures to 
let the kernel consider a larger amount of information will increase the compu¬ 
tational burden. We recall that the main challenge while designing graph kernels 
is the trade-off between the efficiency and the expressive power of the kernel. 
Among the available graph kernels, the NSPDK [5] is the most related to the 
proposed kernel. Specifically, in the RKHS of NSPDK, every feature represents 
a couple of small rooted subgraphs Si and S 2 of a certain diameter (radius) r, 
at a certain distance, i.e. where |sp(r(Si), r(S 2 ))| = d. In a sense, Si can be seen 
as a context for S 2 and vice versa. 

Let us define a set of Ordered Decomposition DAGs of a graph G limited to the 
maximum (user-specified) depth h as ODDg = {DAGh(v, G)|u € Vq }, where we 
recall that the nodes in each DAG are ordered according to a recursive relation 
looking at the labels of a node and all its descendants. The ODDIx kernel [?] is 
defined as: 

h h 

ODDK{Gi,G 2 ) = 5Z EE E CsT(r(Tj(v i)),r(r,(u 2 ))) 

ODi^ODDq 1 j =1 1=1 vi EVodi 

0D2^0DDq 2 v 2 EVod 2 

where CstQ is a function that defines the subtree kernel, i.e. a kernel that counts 
the number of shared proper subtrees between two trees. This kernel allows to 
obtain an explicit feature space representation <j> [F|. Let us define a total ordering 
between all the possible labeled trees that appear from the kernel application on 
the dataset. Then each feature <j>i(G) represents the frequency of the i-th tree in 
the RKHS of the ODD kernel. 


4 Adding Contexts to Graph Kernels 


The graph kernels described in the previous section extracts local patterns of the 
graph as features, i.e. the feature itself does not bring any information regarding 
where it has appeared within the graph. The idea we propose in order to increase 
the expressiveness of a kernel, while preserving efficiency, is to enrich the local 
features (e.g. the features extracted by the ODD kernel) with their contextual 
information. The contextual information we are interested in is a description of 
the topology of the graph around the extracted feature. Thus, a substructure 
that appears in two different graphs will match if and only if it appears within 
the same context in both graphs. Considering contextual information, we ob¬ 
tain kernels that are more sparse. In some cases, the resulting kernel may be 
more discriminative with respect to the original one. However, in other cases 
it may be too much sparse to obtain good performance. In the latter case, it 
can be beneficial to add the contribution of the new kernel to the original one. 
In our experiments, we will implement both these variants. Note that, with our 
proposed approach, the computation of the contributions of the contextualized 
kernel and of the original kernel can be performed efficiently at the same time. 
Fixed a feature of the original graph kernel, we want the following property to 
hold: 

E h°c(G)=<fif(G), 

c(=C ontexts(f) 


where (j>f(G) is the frequency of a feature / in the RKHS of the original kernel, 
and 4>foc is the frequency of / appearing within the context c. From the formula 
it is clear that for each feature we need to consider also the empty context (0- 
context), i.e. the situation in which a feature does not appear in any particular 
context e.g. because it has reached the maximum allowed dimension and we have 
no information about its context in the original graph. 

In the remaining of this section, we will introduce our proposed kernel instantiat¬ 
ing the context idea to the ODD kernel. As a feature represents a substructure, 
in the same way we can represent a context for a feature as a substructure of the 
graph, that incorporates the feature. Therefore, contexts and features can share 
the same representation and so it is possible that a context for a given feature 
can be a feature itself. To compute the contextualized features we only need to 
combine a feature with other features representing the context in which the first 
feature appears in the graph. 

The first important difference between the proposed Tree Context Kernel (TCK) 
and ODDK is that, for technical reasons, the former is defined over tree-visits 
while the latter over DAG-visits. Note that the nodes of a tree-visit T(v,G) of 
a graph G can grow exponentially in its size, while if we consider a DAG-visit 
DAG(v, G), each node in the original graph can appear at most once, thus limit¬ 
ing the size of the resulting structure to at most \Vg\ nodes. However, in the next 
section we will provide an efficient implementation that does not need to store 
in memory the tree-visits, but only the DAG-visits. The Tree Context Kernel 



can be defined as: 


h h 

TCK(G 1 ,G 2 ) = E EE 

vi£V Gl *=1 j =1 
^2eVG 2 

p(“i) 

+ E Ea,K) E CsT{chi{ui),chi(u 2 ))] 

«1 , Z = 1 

A eT d^l) 

U 2 

A e A>2) 

where we recall that: 

{ A ■ Kl(vi, v 2 ) if Vi and v 2 are leaves 

A • ^(wi.t^n^i 0 CsT{chj(v\),chj(v 2 )) if p(v i) = p(v 2 ) 

0 otherwise 

and S is the Kronecker’s delta function. We recall that Cst{vi,v 2 ), v\ £ T\, v 2 £ 
T 2 is a function that counts the common proper subtrees of two trees. The func¬ 
tion depends on T\ and T 2 . We decided to follow the original definition of [3] 
omitting that dependency for ease of notation. 

The kernel is positive semidefinite because it is a composition of positive semidef- 
inite kernels, defined over the ordered tree visits T)(i>, G) = T(DAGi(v , G )) that 
are well defined as shown in [7]. 

Intuitively, this kernel matches two subtree features £ T) (iq, Gi), 0 < i < h 
and € Tj(v 2 , G- 2 ), 0 < j < h in one of the following cases: 

— both vi and v 2 are the root nodes of the tree visit, i.e. u i = v\ and u 2 = v 2 ; 
— u\ and u 2 occur within the same context in both trees, i.e. their parents 
generate the same proper subtree. 

5 Efficient Implementation 

Algorithm [T] shows the pseudocode to decompose a graph G into its explicit 
(sparse) feature vector </>. We will denote with / the map that stores the keys 
of the local subtree features, i.e. f u> d, u G Vg, d € {0,..., h} is the key of the 
subtree rooted in u of height d. Similarly, size is a map that stores the size of 
each feature, i.e. size Uy d is the number of nodes that compose the feature f Uy d- 
Let k be a perfect hash function from strings to integers. Such a function can be 
implemented with an incrementally-built hashmap that associates an unique id 
to each string. Alternatively, a normal hashing function can be used if we tolerate 
some clashes. We define reserved special symbols “J”, and “o” that do 
not have to appear in the labels of the graphs and they are needed to encode 
subtree features into strings. In the following, we will discuss the most sensitive 
steps of the algorithm. In line 6 the nodes of the DAG-visit are traversed in a 


Algorithm 1 An algorithm for computing the explicit feature space representa¬ 
tion of a graph G according to the kernel TCK$t with maximum (user-specified) 
height h and weight factor A 
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4> = [0,..., 0] > Explicit feature space represented as sparse vector 

for all v £ Vg do 

D <- DAG h (v,G) 

f = {} > dictionary that stores the features related to a node u and height d 

size = {} t> dictionary that stores the size of each feature 

for all u £ reverseTopologicalOrder(D) do 
for d <— 0,..., diam(D) — |sp(u, u)| do 
if d — 0 then 
fu ,o <- k(L(u)) 
size u ,o <— 1 
else 

('-i'l 5 ■ ■ ■ j S'p(u) ) SORT (f c hi (u),d— 15 y*ch 2 (w),d —1 5 ' * * } fchp( u ^(u),d—l') 

fu,d k(L(u) \Sl#S2# ■ ■ ■ #S p ( u )\) 

size u ,d t- 1 + 2^f=i size chi ( u ), d -1 
for all ch £ children(u ) do 

<t>fc h ,d-i°f-u,d 4>fch,d-i°f»,d + n_sp{v,u) ■ A 

if w = v then 

aizB u,d 

fifu.dOlZ ^ rf’fu.dOZ + A 2 


return 


reverse topological order, ensuring that every node will be processed before its 
parent. In line 7, for each node u of the current DAG-visit D, we consider all 
the heights for the feature generation. Note that when d = 0, f u< o is a feature 
(proper subtree) of the tree and when d = diam{D) — | sp(v, w)|, f Ut d 

is a feature of Tdi a m(D)(v), where diam(D) < h. Notice that if D is unbalanced 
and we are considering a node u whose |sp(t>,u)| is not maximum, then we are 
considering many times the feature associated to u at its maximum height. In 
lines 12-14, the local feature related to the current node and height is generated. 
The hashed feature values of the children of the current node at height d — 
1 are sorted, generating a feature of height d and inducing an order on the 
children of every node that is the lexicographic order over the hash values of the 
corresponding features. This step allows us not to define any particular ordering 
on the nodes of D. Then the extracted feature is encoded and finally it is hashed. 
Lines 15-16 generate the contextualized features and increment their frequency 
in (/> according to a weight term multiplied by n_sp(v,u). This multiplication 
allows us to compute the statistics related to the tree-visit while working on 
the smaller (in terms of number of nodes) corresponding DAG-visit. Notice that 
n_sp(v,u) is efficiently computed during the creation of DAGh(v,G) in a top- 
down fashion without any additional cost. Finally, lines 17-18 increment the value 
corresponding to the feature with empty context dO0 . This implementation 
returns the explicit sparse feature vector <f>, therefore in order to compute the 






Table 1 . Accuracy results of the proposed kernels and the considered baselines, in 
nested 10-fold cross validation. 


Kernel/dataset 

CAS 

GDD 

NCI1 

AIDS 

CPDB 

NSPDK 

WL 

ODDK 

83.6±o.34 

83.33±o.37 

83.53±o.2i 

74.09±o.9i 

75.29±i.33 

76.99±o.36 

83.46±o.46 

84.41±o.49 

85.31±o.26 

82.71±o.66 

82.02±o.4 

82.99±o.50 

76.99 ± i.i5 

76.36±i. 4 

78.44±o.76 

TCK 

TCK + ODDK 

83.53±o,32 

83.94±q.26 

79.35±o.45 

78.03±o.56 

85.78±o.22 

85.48±o.i82 

82.88±o.39 

82.97±o.5 

76.96±o.96 

78.89±q.98 


kernel function between two graphs is sufficient to compute the dot product 
between the two feature vectors. 

6 Experimental results 

We measured the predictive performance of TCK and other state-of-the-art ker¬ 
nels on the following real-world datasets: AIDS, CAS, CPDB, GDD and NCI1. 
Each dataset represents a binary classification problem and is composed by la¬ 
beled graphs with no self-loops. The AIDS, CAS, CPDB and NCI1 datasets are 
collections of chemical compounds represented as graphs, with nodes labeled ac¬ 
cording to the atom type and edges that represent the bonds. The GDD dataset 
is composed by proteins represented as graphs, where the nodes represent amino 
acids and two nodes in a graph are connected by an edge if they are less than 
6 A apart. The largest datasets are CAS and NCI1 with more than 4000 graphs, 
and the smallest is CPDB with 684 instances. 

Since we cannot know in advance whether the sparsity is beneficial for a par¬ 
ticular task, we choose to test two versions of the proposed kernel. The first 
version (TCK) considers only contextualized features, while the second version 
(TCK + ODDK) combines TCK with the base (non-contextualized) kernel, 
ODDK in our case. Note that TCK + ODDK can be computed with a slight 
modification of Algorithm [1] thus the computational complexities of the two 
versions of the proposed kernels are the same. We compare the proposed kernels 
with the NSPDK kernel [5], the Fast Subtree Kernel (FS) [12] , and the original 
version of the ODDK based on the subtree kernel [7j. To assess the predictive 
performances of the different kernels, we used a nested 10-fold cross validation: 
within each of the 10 folds, another 10-fold cross validation is performed over 
the corresponding training set in order to select the best parameters for the cur¬ 
rent fold. Thus, the parameters are optimized on the training dataset only. The 
whole process has been repeated 10 times using different random data splits. 
The parameter space for both versions of TCK and ODDK was restricted to 
the following values: h = {1,2,..., 10} and A = {0.1,0.5,0.8,0.9,..., 1.5,1.8}. 
The parameter h of the FS kernel were restricted to h = {1,2,..., 10} and for 
the NSPDK the values h = {1, 2,..., 8} and d — {1, 2,..., 7} were considered. 
The SVM solver had the C parameter ranging in C = {10~ 4 ,10 -3 ,..., 10 3 }. 
Table |T] reports the averaged accuracy results of our experiments with the corre¬ 
sponding standard deviations. At a first glance, it is clear that in almost all the 










Gram matrix computation for CAS dataset 



Fig. 1. Copmuptational time (in seconds) required for the Gram matrix computation 
of the considered kernel, with different parameters. 


considered datasets, one of the two proposed kernels is the better performing 
among all the considered kernels, with the only exception of the AIDS dataset. 
Looking at the results in more detail, in two datasets (GDD, NCI1) both ver¬ 
sions of TCK perform better than the others. If we consider the CAS dataset, 
the performance of the worst of the proposed kernels is comparable with the 
better kernel among the baselines (NSPDK). In the CPDB dataset the worst of 
the proposed kernels is worse than the best kernel among the baselines (ODDK), 
but it is still competitive, such as in AIDS dataset, where the proposed kernels 
are very close to the best one. Let us finally anlyze the computational require¬ 
ments of our proposed kernel. Figure |T| reports the computational times required 
for the Gram matrix computation of the kernels considered in this section on 
the CAS dataset. The execution times of the proposed kernel are very close to 
the ones of the original kernel. The situation is similar for other datasets, and 
thus the corresponding plots are omitted. The results presented in this section 
suggest that the introduction of contextualized features is a promising approach, 
and that in principle also other kernels can benefit from such an extension. 

7 Conclusions and future work 

In this paper, we proposed a technique to incorporate context information in 
the kernels that allow for an explicit feature space representation. In particular, 
we defined a relationship between the explicit features where one feature can be 
considered as the context of another one. We applied our idea to the ODDK 
kernel, and slightly modified the kernel definition in order to provide an efficient 
algorithm for the computation of the contextualized kernel. We evaluated the 
predictive performance of the resulting kernel (in two variants) over five real- 













world datasets, and the proposed approach shows promising results. As future 
works, we plan to apply the contextualization idea to other state-of-the art graph 
kernels, as well as to kernels for other discrete structures. 
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